Import the necessary libraries

In [169]:
# import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
%matplotlib inline

1. Data pre-processing – Perform all the necessary preprocessing on the data ready to be fed to an Unsupervised algorithm

In [170]:
#reading the vechile data
vechiledata = pd.read_csv(r'D:\SAI\MECH\Great Learning\Study\UNSupervisedLearning\Project\vehicledata.csv')
vechiledata.head()
Out[170]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
In [171]:
#Understanding shape of data
vechiledata.shape
Out[171]:
(846, 19)

The data shows it has 846 rows & 19 columns

In [172]:
#get detail information
vechiledata.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   compactness                  846 non-null    int64  
 1   circularity                  841 non-null    float64
 2   distance_circularity         842 non-null    float64
 3   radius_ratio                 840 non-null    float64
 4   pr.axis_aspect_ratio         844 non-null    float64
 5   max.length_aspect_ratio      846 non-null    int64  
 6   scatter_ratio                845 non-null    float64
 7   elongatedness                845 non-null    float64
 8   pr.axis_rectangularity       843 non-null    float64
 9   max.length_rectangularity    846 non-null    int64  
 10  scaled_variance              843 non-null    float64
 11  scaled_variance.1            844 non-null    float64
 12  scaled_radius_of_gyration    844 non-null    float64
 13  scaled_radius_of_gyration.1  842 non-null    float64
 14  skewness_about               840 non-null    float64
 15  skewness_about.1             845 non-null    float64
 16  skewness_about.2             845 non-null    float64
 17  hollows_ratio                846 non-null    int64  
 18  class                        846 non-null    object 
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB

All the attributes are of int,float except class is of category

In [173]:
#class attribute is not an object it is a category
vechiledata['class']=vechiledata['class'].astype('category')
In [174]:
#get the summary of data
vechiledata.describe().transpose()
Out[174]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.0 119.0
circularity 841.0 44.828775 6.152172 33.0 40.00 44.0 49.0 59.0
distance_circularity 842.0 82.110451 15.778292 40.0 70.00 80.0 98.0 112.0
radius_ratio 840.0 168.888095 33.520198 104.0 141.00 167.0 195.0 333.0
pr.axis_aspect_ratio 844.0 61.678910 7.891463 47.0 57.00 61.0 65.0 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.0 55.0
scatter_ratio 845.0 168.901775 33.214848 112.0 147.00 157.0 198.0 265.0
elongatedness 845.0 40.933728 7.816186 26.0 33.00 43.0 46.0 61.0
pr.axis_rectangularity 843.0 20.582444 2.592933 17.0 19.00 20.0 23.0 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.0 188.0
scaled_variance 843.0 188.631079 31.411004 130.0 167.00 179.0 217.0 320.0
scaled_variance.1 844.0 439.494076 176.666903 184.0 318.00 363.5 587.0 1018.0
scaled_radius_of_gyration 844.0 174.709716 32.584808 109.0 149.00 173.5 198.0 268.0
scaled_radius_of_gyration.1 842.0 72.447743 7.486190 59.0 67.00 71.5 75.0 135.0
skewness_about 840.0 6.364286 4.920649 0.0 2.00 6.0 9.0 22.0
skewness_about.1 845.0 12.602367 8.936081 0.0 5.00 11.0 19.0 41.0
skewness_about.2 845.0 188.919527 6.155809 176.0 184.00 188.0 193.0 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.0 211.0
In [175]:
#Checking for missing values in the dataset
vechiledata.isnull().sum()
Out[175]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64

Some attributes are having missing values replacing in median

In [176]:
#replace missing variable('?') into null variable using numpy
vechiledata = vechiledata.replace(' ', np.nan)
In [177]:
#Replacing the missing values by median 
for i in vechiledata.columns[:17]:
    median_value = vechiledata[i].median()
    vechiledata[i] = vechiledata[i].fillna(median_value)
In [178]:
# Again checking for missing values in the dataset
vechiledata.isnull().sum()
Out[178]:
compactness                    0
circularity                    0
distance_circularity           0
radius_ratio                   0
pr.axis_aspect_ratio           0
max.length_aspect_ratio        0
scatter_ratio                  0
elongatedness                  0
pr.axis_rectangularity         0
max.length_rectangularity      0
scaled_variance                0
scaled_variance.1              0
scaled_radius_of_gyration      0
scaled_radius_of_gyration.1    0
skewness_about                 0
skewness_about.1               0
skewness_about.2               0
hollows_ratio                  0
class                          0
dtype: int64

2. Understanding the attributes - Find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why

In [179]:
#graphical representation Histplot
vechiledata.hist(figsize=(15,15));
In [180]:
# A quick check to find columns that contain outliers
fig = plt.figure(figsize = (15, 7.2))
ax = sns.boxplot(data = vechiledata.iloc[:, 0:18], orient = 'h')

Get the target column distribution

In [181]:
vechiledata['class'].unique()
Out[181]:
[van, car, bus]
Categories (3, object): [van, car, bus]
In [182]:
vechiledata['class'].value_counts()
Out[182]:
car    429
bus    218
van    199
Name: class, dtype: int64
In [183]:
sns.countplot(vechiledata['class']);
In [184]:
groupby=vechiledata.groupby('class')
groupby.mean()
Out[184]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
class
bus 91.591743 44.963303 76.811927 165.720183 63.403670 7.013761 170.022936 40.114679 20.577982 146.701835 192.889908 448.894495 180.963303 77.071101 4.816514 10.211009 187.811927 191.325688
car 96.184149 46.030303 88.878788 180.496503 60.993007 8.825175 180.997669 38.104895 21.508159 149.967366 197.806527 499.904429 179.613054 69.935897 7.121212 15.160839 189.470862 197.582751
van 90.562814 42.070352 73.281407 147.276382 61.261307 9.713568 141.537688 47.939698 18.582915 145.175879 164.040201 298.201005 157.276382 72.778894 6.417085 9.698492 188.939698 196.145729
In [185]:
#importing the Encoding library
from sklearn.preprocessing import LabelEncoder

#Encoding of categorical variables
labelencoder_X=LabelEncoder()
vechiledata['class']=labelencoder_X.fit_transform(vechiledata['class'])
In [187]:
# kde plots to show the distribution of the all the variables with respect to dependent variable
k=1
plt.figure(figsize=(20,30))
for col in vechiledata.columns[0:18]:
    plt.subplot(5,4,k)
    sns.kdeplot(vechiledata[vechiledata['class']==0][col],color='red',label='car',shade=True)
    sns.kdeplot(vechiledata[vechiledata['class']==1][col],color='blue',label='bus',shade=True)
    sns.kdeplot(vechiledata[vechiledata['class']==2][col],color='yellow',label='van',shade=True)
    plt.title(col)
    k=k+1
In [188]:
#correlation matrix
cor=vechiledata.corr()
cor
Out[188]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
compactness 1.000000 0.684887 0.789928 0.689743 0.091534 0.148249 0.812620 -0.788750 0.813694 0.676143 0.762070 0.814012 0.585243 -0.249593 0.236078 0.157015 0.298537 0.365552 -0.033796
circularity 0.684887 1.000000 0.792320 0.620912 0.153778 0.251467 0.847938 -0.821472 0.843400 0.961318 0.796306 0.835946 0.925816 0.051946 0.144198 -0.011439 -0.104426 0.046351 -0.158910
distance_circularity 0.789928 0.792320 1.000000 0.767035 0.158456 0.264686 0.905076 -0.911307 0.893025 0.774527 0.861519 0.886017 0.705771 -0.225944 0.113924 0.265547 0.146098 0.332732 -0.064467
radius_ratio 0.689743 0.620912 0.767035 1.000000 0.663447 0.450052 0.734429 -0.789481 0.708385 0.568949 0.793415 0.718436 0.536372 -0.180397 0.048713 0.173741 0.382214 0.471309 -0.182186
pr.axis_aspect_ratio 0.091534 0.153778 0.158456 0.663447 1.000000 0.648724 0.103732 -0.183035 0.079604 0.126909 0.272910 0.089189 0.121971 0.152950 -0.058371 -0.031976 0.239886 0.267725 -0.098178
max.length_aspect_ratio 0.148249 0.251467 0.264686 0.450052 0.648724 1.000000 0.166191 -0.180140 0.161502 0.305943 0.318957 0.143253 0.189743 0.295735 0.015599 0.043422 -0.026081 0.143919 0.207619
scatter_ratio 0.812620 0.847938 0.905076 0.734429 0.103732 0.166191 1.000000 -0.971601 0.989751 0.809083 0.948662 0.993012 0.799875 -0.027542 0.074458 0.212428 0.005628 0.118817 -0.288895
elongatedness -0.788750 -0.821472 -0.911307 -0.789481 -0.183035 -0.180140 -0.971601 1.000000 -0.948996 -0.775854 -0.936382 -0.953816 -0.766314 0.103302 -0.052600 -0.185053 -0.115126 -0.216905 0.339344
pr.axis_rectangularity 0.813694 0.843400 0.893025 0.708385 0.079604 0.161502 0.989751 -0.948996 1.000000 0.810934 0.934227 0.988213 0.796690 -0.015495 0.083767 0.214700 -0.018649 0.099286 -0.258481
max.length_rectangularity 0.676143 0.961318 0.774527 0.568949 0.126909 0.305943 0.809083 -0.775854 0.810934 1.000000 0.744985 0.794615 0.866450 0.041622 0.135852 0.001366 -0.103948 0.076770 -0.032399
scaled_variance 0.762070 0.796306 0.861519 0.793415 0.272910 0.318957 0.948662 -0.936382 0.934227 0.744985 1.000000 0.945678 0.778917 0.113078 0.036729 0.194239 0.014219 0.085695 -0.312943
scaled_variance.1 0.814012 0.835946 0.886017 0.718436 0.089189 0.143253 0.993012 -0.953816 0.988213 0.794615 0.945678 1.000000 0.795017 -0.015401 0.076877 0.200811 0.006219 0.102935 -0.288115
scaled_radius_of_gyration 0.585243 0.925816 0.705771 0.536372 0.121971 0.189743 0.799875 -0.766314 0.796690 0.866450 0.778917 0.795017 1.000000 0.191473 0.166483 -0.056153 -0.224450 -0.118002 -0.250267
scaled_radius_of_gyration.1 -0.249593 0.051946 -0.225944 -0.180397 0.152950 0.295735 -0.027542 0.103302 -0.015495 0.041622 0.113078 -0.015401 0.191473 1.000000 -0.088355 -0.126183 -0.748865 -0.802123 -0.212601
skewness_about 0.236078 0.144198 0.113924 0.048713 -0.058371 0.015599 0.074458 -0.052600 0.083767 0.135852 0.036729 0.076877 0.166483 -0.088355 1.000000 -0.034990 0.115297 0.097126 0.119581
skewness_about.1 0.157015 -0.011439 0.265547 0.173741 -0.031976 0.043422 0.212428 -0.185053 0.214700 0.001366 0.194239 0.200811 -0.056153 -0.126183 -0.034990 1.000000 0.077310 0.204990 -0.010680
skewness_about.2 0.298537 -0.104426 0.146098 0.382214 0.239886 -0.026081 0.005628 -0.115126 -0.018649 -0.103948 0.014219 0.006219 -0.224450 -0.748865 0.115297 0.077310 1.000000 0.892581 0.067244
hollows_ratio 0.365552 0.046351 0.332732 0.471309 0.267725 0.143919 0.118817 -0.216905 0.099286 0.076770 0.085695 0.102935 -0.118002 -0.802123 0.097126 0.204990 0.892581 1.000000 0.235874
class -0.033796 -0.158910 -0.064467 -0.182186 -0.098178 0.207619 -0.288895 0.339344 -0.258481 -0.032399 -0.312943 -0.288115 -0.250267 -0.212601 0.119581 -0.010680 0.067244 0.235874 1.000000
In [189]:
#graphical representation of heatmap
plt.subplots(figsize=(10,8))
sns.heatmap(cor,annot=True,linewidths=.5,center=0,cbar=False,cmap="YlGnBu");
In [190]:
#Pair plot that includes all the columns of the data frame
sns.pairplot(vechiledata,hue='class');

3. Split the data into train and test (Suggestion: specify “random state” if you are using train_test_split from Sklearn)

In [191]:
## Creating a copy of dataframe for manipulation
vechiledata_split = vechiledata
vechiledata_split.head()
Out[191]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 2
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 2
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 1
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 2
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 0
In [192]:
##Importing training and test set split
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
In [193]:
## Define X and y variables
X = vechiledata_split.drop('class',axis=1)
y = vechiledata_split['class']
In [194]:
#importing the zscore for scaling
from scipy.stats import zscore
XScaled=X.apply(zscore)
XScaled.head()
Out[194]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
0 0.160580 0.518073 0.057177 0.273363 1.310398 0.311542 -0.207598 0.136262 -0.224342 0.758332 -0.401920 -0.341934 0.285705 -0.327326 -0.073812 0.380870 -0.312012 0.183957
1 -0.325470 -0.623732 0.120741 -0.835032 -0.593753 0.094079 -0.599423 0.520519 -0.610886 -0.344578 -0.593357 -0.619724 -0.513630 -0.059384 0.538390 0.156798 0.013265 0.452977
2 1.254193 0.844303 1.519141 1.202018 0.548738 0.311542 1.148719 -1.144597 0.935290 0.689401 1.097671 1.109379 1.392477 0.074587 1.558727 -0.403383 -0.149374 0.049447
3 -0.082445 -0.623732 -0.006386 -0.295813 0.167907 0.094079 -0.750125 0.648605 -0.610886 -0.344578 -0.912419 -0.738777 -1.466683 -1.265121 -0.073812 -0.291347 1.639649 1.529056
4 -1.054545 -0.134387 -0.769150 1.082192 5.245643 9.444962 -0.599423 0.520519 -0.610886 -0.275646 1.671982 -0.648070 0.408680 7.309005 0.538390 -0.179311 -1.450481 -1.699181
In [195]:
#splitting the data to 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(XScaled, y, test_size=0.30, random_state=10)

4.Train a Support vector machine using the train set and get the accuracy on the test set

In [196]:
# Import the metrics
from sklearn import metrics
# Import Support Vector Classifier machine learning library
from sklearn.svm import SVC
from sklearn.metrics import classification_report
In [197]:
# Building a Support Vector Machine on train data
svc_model = SVC()
svc_model.fit(X_train, y_train)

ypred_SVM = svc_model.predict(X_test)
In [198]:
# check the accuracy on the training set
print('Accuracy of  SVM model on train set: {:.2f}'.format(svc_model.score(X_train, y_train)))
print('Accuracy of  SVM model on train set: {:.2f}'.format(svc_model.score(X_test, y_test)))
Accuracy of  SVM model on train set: 0.97
Accuracy of  SVM model on train set: 0.96
In [199]:
print(classification_report(y_test, ypred_SVM))
              precision    recall  f1-score   support

           0       1.00      0.96      0.98        71
           1       0.98      0.96      0.97       125
           2       0.89      0.97      0.93        58

    accuracy                           0.96       254
   macro avg       0.95      0.96      0.96       254
weighted avg       0.96      0.96      0.96       254

5. Perform K-fold cross validation and get the cross validation score of the model

In [200]:
# Importing libraries KFold,cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
In [201]:
num_folds = 10
seed = 10

kfold = KFold(n_splits=num_folds, random_state=seed,shuffle=True);
model = SVC()
results = cross_val_score(model, XScaled, y, cv=kfold);
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
[0.95294118 0.94117647 0.97647059 0.94117647 0.94117647 0.95294118
 0.96428571 0.98809524 0.98809524 0.96428571]
Accuracy: 96.106% (1.743%)

6. Use PCA from Scikit learn, extract Principal Components that capture about 95% of the variance in the data

In [202]:
#Importing PCA for dimensionality reduction and visualization
from sklearn.decomposition import PCA
In [203]:
#covarience matrix of scaled data
covMatrix = np.cov(XScaled,rowvar=False)
print(covMatrix)
[[ 1.00118343  0.68569786  0.79086299  0.69055952  0.09164265  0.14842463
   0.81358214 -0.78968322  0.81465658  0.67694334  0.76297234  0.81497566
   0.58593517 -0.24988794  0.23635777  0.15720044  0.29889034  0.36598446]
 [ 0.68569786  1.00118343  0.79325751  0.6216467   0.15396023  0.25176438
   0.8489411  -0.82244387  0.84439802  0.96245572  0.79724837  0.83693508
   0.92691166  0.05200785  0.14436828 -0.01145212 -0.10455005  0.04640562]
 [ 0.79086299  0.79325751  1.00118343  0.76794246  0.15864319  0.26499957
   0.90614687 -0.9123854   0.89408198  0.77544391  0.86253904  0.88706577
   0.70660663 -0.22621115  0.1140589   0.26586088  0.14627113  0.33312625]
 [ 0.69055952  0.6216467   0.76794246  1.00118343  0.66423242  0.45058426
   0.73529816 -0.79041561  0.70922371  0.56962256  0.79435372  0.71928618
   0.53700678 -0.18061084  0.04877032  0.17394649  0.38266622  0.47186659]
 [ 0.09164265  0.15396023  0.15864319  0.66423242  1.00118343  0.64949139
   0.10385472 -0.18325156  0.07969786  0.1270594   0.27323306  0.08929427
   0.12211524  0.15313091 -0.05843967 -0.0320139   0.24016968  0.26804208]
 [ 0.14842463  0.25176438  0.26499957  0.45058426  0.64949139  1.00118343
   0.16638787 -0.18035326  0.16169312  0.30630475  0.31933428  0.1434227
   0.18996732  0.29608463  0.01561769  0.04347324 -0.02611148  0.14408905]
 [ 0.81358214  0.8489411   0.90614687  0.73529816  0.10385472  0.16638787
   1.00118343 -0.97275069  0.99092181  0.81004084  0.94978498  0.9941867
   0.80082111 -0.02757446  0.07454578  0.21267959  0.00563439  0.1189581 ]
 [-0.78968322 -0.82244387 -0.9123854  -0.79041561 -0.18325156 -0.18035326
  -0.97275069  1.00118343 -0.95011894 -0.77677186 -0.93748998 -0.95494487
  -0.76722075  0.10342428 -0.05266193 -0.18527244 -0.11526213 -0.2171615 ]
 [ 0.81465658  0.84439802  0.89408198  0.70922371  0.07969786  0.16169312
   0.99092181 -0.95011894  1.00118343  0.81189327  0.93533261  0.98938264
   0.79763248 -0.01551372  0.08386628  0.21495454 -0.01867064  0.09940372]
 [ 0.67694334  0.96245572  0.77544391  0.56962256  0.1270594   0.30630475
   0.81004084 -0.77677186  0.81189327  1.00118343  0.74586628  0.79555492
   0.86747579  0.04167099  0.13601231  0.00136727 -0.10407076  0.07686047]
 [ 0.76297234  0.79724837  0.86253904  0.79435372  0.27323306  0.31933428
   0.94978498 -0.93748998  0.93533261  0.74586628  1.00118343  0.94679667
   0.77983844  0.11321163  0.03677248  0.19446837  0.01423606  0.08579656]
 [ 0.81497566  0.83693508  0.88706577  0.71928618  0.08929427  0.1434227
   0.9941867  -0.95494487  0.98938264  0.79555492  0.94679667  1.00118343
   0.79595778 -0.01541878  0.07696823  0.20104818  0.00622636  0.10305714]
 [ 0.58593517  0.92691166  0.70660663  0.53700678  0.12211524  0.18996732
   0.80082111 -0.76722075  0.79763248  0.86747579  0.77983844  0.79595778
   1.00118343  0.19169941  0.16667971 -0.05621953 -0.22471583 -0.11814142]
 [-0.24988794  0.05200785 -0.22621115 -0.18061084  0.15313091  0.29608463
  -0.02757446  0.10342428 -0.01551372  0.04167099  0.11321163 -0.01541878
   0.19169941  1.00118343 -0.08846001 -0.12633227 -0.749751   -0.80307227]
 [ 0.23635777  0.14436828  0.1140589   0.04877032 -0.05843967  0.01561769
   0.07454578 -0.05266193  0.08386628  0.13601231  0.03677248  0.07696823
   0.16667971 -0.08846001  1.00118343 -0.03503155  0.1154338   0.09724079]
 [ 0.15720044 -0.01145212  0.26586088  0.17394649 -0.0320139   0.04347324
   0.21267959 -0.18527244  0.21495454  0.00136727  0.19446837  0.20104818
  -0.05621953 -0.12633227 -0.03503155  1.00118343  0.07740174  0.20523257]
 [ 0.29889034 -0.10455005  0.14627113  0.38266622  0.24016968 -0.02611148
   0.00563439 -0.11526213 -0.01867064 -0.10407076  0.01423606  0.00622636
  -0.22471583 -0.749751    0.1154338   0.07740174  1.00118343  0.89363767]
 [ 0.36598446  0.04640562  0.33312625  0.47186659  0.26804208  0.14408905
   0.1189581  -0.2171615   0.09940372  0.07686047  0.08579656  0.10305714
  -0.11814142 -0.80307227  0.09724079  0.20523257  0.89363767  1.00118343]]
In [204]:
#Finding eigenvalues amd eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(covMatrix)
print('Eigen Vectors \n%s', eigenvectors)
print('\n Eigen Values \n%s', eigenvalues)
Eigen Vectors 
%s [[ 2.75283688e-01  1.26953763e-01  1.19922479e-01 -7.83843562e-02
  -6.95178336e-02  1.44875476e-01  4.51862331e-01  5.66136785e-01
   4.84418105e-01  2.60076393e-01 -4.65342885e-02  1.20344026e-02
  -1.56136836e-01 -1.00728764e-02 -6.00532537e-03  6.00485194e-02
  -6.50956666e-02 -9.67780251e-03]
 [ 2.93258469e-01 -1.25576727e-01  2.48205467e-02 -1.87337408e-01
   8.50649539e-02 -3.02731148e-01 -2.49103387e-01  1.79851809e-01
   1.41569001e-02 -9.80779086e-02 -3.01323693e-03 -2.13635088e-01
  -1.50116709e-02 -9.15939674e-03  7.38059396e-02 -4.26993118e-01
  -2.61244802e-01 -5.97862837e-01]
 [ 3.04609128e-01  7.29516436e-02  5.60143254e-02  7.12008427e-02
  -4.06645651e-02 -1.38405773e-01  7.40350569e-02 -4.34748988e-01
   1.67572478e-01  2.05031597e-01 -7.06489498e-01  3.46330345e-04
   2.37111452e-01  6.94599696e-03 -2.50791236e-02  1.46240270e-01
   7.82651714e-02 -1.57257142e-01]
 [ 2.67606877e-01  1.89634378e-01 -2.75074211e-01  4.26053415e-02
   4.61473714e-02  2.48136636e-01 -1.76912814e-01 -1.01998360e-01
   2.30313563e-01  4.77888949e-02  1.07151583e-01 -1.57049977e-01
   3.07818692e-02 -4.20156482e-02 -3.59880417e-02 -5.21374718e-01
   5.60792139e-01  1.66551725e-01]
 [ 8.05039890e-02  1.22174860e-01 -6.42012966e-01 -3.27257119e-02
   4.05494487e-02  2.36932611e-01 -3.97876601e-01  6.87147927e-02
   2.77128307e-01 -1.08075009e-01 -3.85169721e-02  1.10106595e-01
   3.92804479e-02  3.12698087e-02  1.25847434e-02  3.63120360e-01
  -3.22276873e-01 -6.36138719e-02]
 [ 9.72756855e-02 -1.07482875e-02 -5.91801304e-01 -3.14147277e-02
  -2.13432566e-01 -4.19330747e-01  5.03413610e-01 -1.61153097e-01
  -1.48032250e-01  1.18266345e-01  2.62254132e-01 -1.32935328e-01
  -3.72884301e-02 -9.99915816e-03 -2.84168792e-02  6.27796802e-02
   4.87809642e-02 -8.63169844e-02]
 [ 3.17092750e-01 -4.81181371e-02  9.76283108e-02  9.57485748e-02
   1.54853055e-02  1.16100153e-01  6.49879382e-02 -1.00688056e-01
  -5.44574214e-02 -1.65167200e-01  1.70405800e-01  9.55883216e-02
  -3.94638419e-02  8.40975659e-01 -2.49652703e-01  6.40502241e-02
   1.81839668e-02 -7.98693109e-02]
 [-3.14133155e-01 -1.27498515e-02 -5.76484384e-02 -8.22901952e-02
  -7.68518712e-02 -1.41840112e-01  1.38112945e-02  2.15497166e-01
   1.56867362e-01  1.51612333e-01  5.76632611e-02  1.22012715e-01
   8.10394855e-01  2.38188639e-01 -4.21478467e-02 -1.86946145e-01
  -2.50330194e-02  4.21515054e-02]
 [ 3.13959064e-01 -5.99352482e-02  1.09512416e-01  9.24582989e-02
  -2.17633157e-03  9.80561329e-02  9.66573058e-02 -6.35933915e-02
  -5.24978759e-03 -1.93777917e-01  2.72514033e-01  2.51281206e-01
   2.71573184e-01 -1.01154594e-01  7.17396292e-01  1.80912790e-01
   1.64490784e-01 -1.44490635e-01]
 [ 2.82830900e-01 -1.16220532e-01  1.70641987e-02 -1.88005612e-01
   6.06366845e-02 -4.61674972e-01 -1.04552173e-01  2.49495867e-01
   6.10362445e-02 -4.69059999e-01 -1.41434233e-01 -1.24529334e-01
   7.57105808e-02 -1.69481636e-02 -4.70233017e-02  1.74070296e-01
   1.47280090e-01  5.11259153e-01]
 [ 3.09280359e-01 -6.22806229e-02 -5.63239801e-02  1.19844008e-01
   4.56472367e-04  2.36225434e-01  1.14622578e-01 -5.02096319e-02
  -2.97588112e-01  1.29986011e-01 -7.72596638e-02 -2.15011644e-01
   1.53180808e-01  6.04665108e-03  1.71503771e-01 -2.77272123e-01
  -5.64444637e-01  4.53236855e-01]
 [ 3.13788457e-01 -5.37843596e-02  1.08840729e-01  9.17449325e-02
   1.95548315e-02  1.57820194e-01  8.37350220e-02 -4.37649907e-02
  -8.33669838e-02 -1.58203940e-01  2.43226301e-01  1.75685051e-01
   3.07948154e-01 -4.69202757e-01 -6.16589383e-01  7.85141734e-02
  -6.85856929e-02 -1.26992250e-01]
 [ 2.72047492e-01 -2.09233172e-01  3.14636493e-02 -2.00095228e-01
   6.15991681e-02 -1.35576278e-01 -3.73944382e-01  1.08474496e-01
  -2.41655483e-01  6.86493700e-01  1.58888394e-01  1.90336498e-01
  -3.76087492e-02  1.17483082e-02 -2.64910290e-02  2.00683948e-01
   1.47099233e-01  1.09982525e-01]
 [-2.08137692e-02 -4.88525148e-01 -2.86277015e-01  6.55051354e-02
  -1.45530146e-01  2.41356821e-01  1.11952983e-01  3.40878491e-01
  -3.20221887e-01 -1.27648385e-01 -4.19188664e-01  2.85710601e-01
  -4.34650674e-02  3.14812146e-03 -1.42959461e-02 -1.46861607e-01
   2.32941262e-01 -1.11271959e-01]
 [ 4.14555082e-02  5.50899716e-02  1.15679354e-01 -6.04794251e-01
  -7.29189842e-01  2.03209257e-01 -8.06328902e-02 -1.56487670e-01
  -2.21054148e-02 -9.83643219e-02  1.25447648e-02 -1.60327156e-03
  -9.94304634e-03 -3.03156233e-03  1.74310271e-03 -1.73360301e-02
  -2.77589170e-02  2.40943096e-02]
 [ 5.82250207e-02  1.24085090e-01  7.52828901e-02  6.66114117e-01
  -5.99196401e-01 -1.91960802e-01 -2.84558723e-01  2.08774083e-01
  -1.01761758e-02  3.55150608e-02  3.27808069e-02 -8.32589542e-02
  -2.68915150e-02 -1.25315953e-02 -7.08894692e-03  3.13689218e-02
   2.78187408e-03 -9.89651885e-03]
 [ 3.02795063e-02  5.40914775e-01 -8.73592034e-03 -1.05526253e-01
   1.00602332e-01  1.56939174e-01  1.81451818e-02  3.04580219e-01
  -5.17222779e-01 -1.93956186e-02 -1.20597635e-01 -3.53723696e-01
   1.86595152e-01  4.34282436e-02  7.67874680e-03  2.31451048e-01
   1.90629960e-01 -1.82212045e-01]
 [ 7.41453913e-02  5.40354258e-01 -3.95242743e-02 -4.74890311e-02
   2.98614819e-02 -2.41222817e-01  1.57237839e-02  3.04186304e-02
  -1.71506343e-01 -6.41314778e-02 -9.19597847e-02  6.85618161e-01
  -1.42380007e-01 -6.47700819e-03  6.37681817e-03 -2.88502234e-01
  -1.20966490e-01  9.04014702e-02]]

 Eigen Values 
%s [9.40460261e+00 3.01492206e+00 1.90352502e+00 1.17993747e+00
 9.17260633e-01 5.39992629e-01 3.58870118e-01 2.21932456e-01
 1.60608597e-01 9.18572234e-02 6.64994118e-02 4.66005994e-02
 3.57947189e-02 2.96445743e-03 1.00257898e-02 2.74120657e-02
 1.79166314e-02 2.05792871e-02]
In [205]:
# Make a set of (eigenvalue, eigenvector) pairs
eigen_pairs = [(np.abs(eigenvalues[i]), eigenvectors[:,i]) for i in range(len(eigenvalues))]
eigen_pairs.sort(reverse=True)
eigen_pairs[:]
Out[205]:
[(9.404602609088705,
  array([ 0.27528369,  0.29325847,  0.30460913,  0.26760688,  0.08050399,
          0.09727569,  0.31709275, -0.31413315,  0.31395906,  0.2828309 ,
          0.30928036,  0.31378846,  0.27204749, -0.02081377,  0.04145551,
          0.05822502,  0.03027951,  0.07414539])),
 (3.014922058524633,
  array([ 0.12695376, -0.12557673,  0.07295164,  0.18963438,  0.12217486,
         -0.01074829, -0.04811814, -0.01274985, -0.05993525, -0.11622053,
         -0.06228062, -0.05378436, -0.20923317, -0.48852515,  0.05508997,
          0.12408509,  0.54091477,  0.54035426])),
 (1.9035250218389657,
  array([ 0.11992248,  0.02482055,  0.05601433, -0.27507421, -0.64201297,
         -0.5918013 ,  0.09762831, -0.05764844,  0.10951242,  0.0170642 ,
         -0.05632398,  0.10884073,  0.03146365, -0.28627701,  0.11567935,
          0.07528289, -0.00873592, -0.03952427])),
 (1.1799374684450215,
  array([-0.07838436, -0.18733741,  0.07120084,  0.04260534, -0.03272571,
         -0.03141473,  0.09574857, -0.0822902 ,  0.0924583 , -0.18800561,
          0.11984401,  0.09174493, -0.20009523,  0.06550514, -0.60479425,
          0.66611412, -0.10552625, -0.04748903])),
 (0.9172606328594372,
  array([-6.95178336e-02,  8.50649539e-02, -4.06645651e-02,  4.61473714e-02,
          4.05494487e-02, -2.13432566e-01,  1.54853055e-02, -7.68518712e-02,
         -2.17633157e-03,  6.06366845e-02,  4.56472367e-04,  1.95548315e-02,
          6.15991681e-02, -1.45530146e-01, -7.29189842e-01, -5.99196401e-01,
          1.00602332e-01,  2.98614819e-02])),
 (0.5399926288001127,
  array([ 0.14487548, -0.30273115, -0.13840577,  0.24813664,  0.23693261,
         -0.41933075,  0.11610015, -0.14184011,  0.09805613, -0.46167497,
          0.23622543,  0.15782019, -0.13557628,  0.24135682,  0.20320926,
         -0.1919608 ,  0.15693917, -0.24122282])),
 (0.3588701179293984,
  array([ 0.45186233, -0.24910339,  0.07403506, -0.17691281, -0.3978766 ,
          0.50341361,  0.06498794,  0.01381129,  0.09665731, -0.10455217,
          0.11462258,  0.08373502, -0.37394438,  0.11195298, -0.08063289,
         -0.28455872,  0.01814518,  0.01572378])),
 (0.2219324559989345,
  array([ 0.56613679,  0.17985181, -0.43474899, -0.10199836,  0.06871479,
         -0.1611531 , -0.10068806,  0.21549717, -0.06359339,  0.24949587,
         -0.05020963, -0.04376499,  0.1084745 ,  0.34087849, -0.15648767,
          0.20877408,  0.30458022,  0.03041863])),
 (0.16060859663511767,
  array([ 0.4844181 ,  0.0141569 ,  0.16757248,  0.23031356,  0.27712831,
         -0.14803225, -0.05445742,  0.15686736, -0.00524979,  0.06103624,
         -0.29758811, -0.08336698, -0.24165548, -0.32022189, -0.02210541,
         -0.01017618, -0.51722278, -0.17150634])),
 (0.09185722339516111,
  array([ 0.26007639, -0.09807791,  0.2050316 ,  0.04778889, -0.10807501,
          0.11826635, -0.1651672 ,  0.15161233, -0.19377792, -0.46906   ,
          0.12998601, -0.15820394,  0.6864937 , -0.12764838, -0.09836432,
          0.03551506, -0.01939562, -0.06413148])),
 (0.06649941176460208,
  array([-0.04653429, -0.00301324, -0.7064895 ,  0.10715158, -0.03851697,
          0.26225413,  0.1704058 ,  0.05766326,  0.27251403, -0.14143423,
         -0.07725966,  0.2432263 ,  0.15888839, -0.41918866,  0.01254476,
          0.03278081, -0.12059763, -0.09195978])),
 (0.04660059944187704,
  array([ 1.20344026e-02, -2.13635088e-01,  3.46330345e-04, -1.57049977e-01,
          1.10106595e-01, -1.32935328e-01,  9.55883216e-02,  1.22012715e-01,
          2.51281206e-01, -1.24529334e-01, -2.15011644e-01,  1.75685051e-01,
          1.90336498e-01,  2.85710601e-01, -1.60327156e-03, -8.32589542e-02,
         -3.53723696e-01,  6.85618161e-01])),
 (0.03579471891303873,
  array([-0.15613684, -0.01501167,  0.23711145,  0.03078187,  0.03928045,
         -0.03728843, -0.03946384,  0.81039486,  0.27157318,  0.07571058,
          0.15318081,  0.30794815, -0.03760875, -0.04346507, -0.00994305,
         -0.02689151,  0.18659515, -0.14238001])),
 (0.027412065737195113,
  array([ 0.06004852, -0.42699312,  0.14624027, -0.52137472,  0.36312036,
          0.06277968,  0.06405022, -0.18694615,  0.18091279,  0.1740703 ,
         -0.27727212,  0.07851417,  0.20068395, -0.14686161, -0.01733603,
          0.03136892,  0.23145105, -0.28850223])),
 (0.020579287070888724,
  array([-0.0096778 , -0.59786284, -0.15725714,  0.16655173, -0.06361387,
         -0.08631698, -0.07986931,  0.04215151, -0.14449063,  0.51125915,
          0.45323685, -0.12699225,  0.10998252, -0.11127196,  0.02409431,
         -0.00989652, -0.18221204,  0.09040147])),
 (0.01791663143223643,
  array([-0.06509567, -0.2612448 ,  0.07826517,  0.56079214, -0.32227687,
          0.04878096,  0.01818397, -0.02503302,  0.16449078,  0.14728009,
         -0.56444464, -0.06858569,  0.14709923,  0.23294126, -0.02775892,
          0.00278187,  0.19062996, -0.12096649])),
 (0.010025789847555906,
  array([-0.00600533,  0.07380594, -0.02507912, -0.03598804,  0.01258474,
         -0.02841688, -0.2496527 , -0.04214785,  0.71739629, -0.0470233 ,
          0.17150377, -0.61658938, -0.02649103, -0.01429595,  0.0017431 ,
         -0.00708895,  0.00767875,  0.00637682])),
 (0.002964457425044782,
  array([-0.01007288, -0.0091594 ,  0.006946  , -0.04201565,  0.03126981,
         -0.00999916,  0.84097566,  0.23818864, -0.10115459, -0.01694816,
          0.00604665, -0.46920276,  0.01174831,  0.00314812, -0.00303156,
         -0.0125316 ,  0.04342824, -0.00647701]))]
In [206]:
# print out eigenvalues
print('Eigenvalues in descending order: \n%s' %eigenvalues)
Eigenvalues in descending order: 
[9.40460261e+00 3.01492206e+00 1.90352502e+00 1.17993747e+00
 9.17260633e-01 5.39992629e-01 3.58870118e-01 2.21932456e-01
 1.60608597e-01 9.18572234e-02 6.64994118e-02 4.66005994e-02
 3.57947189e-02 2.96445743e-03 1.00257898e-02 2.74120657e-02
 1.79166314e-02 2.05792871e-02]
In [207]:
tot = sum(eigenvalues)
var_exp = [( i /tot ) * 100 for i in sorted(eigenvalues, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
plt.plot(var_exp)
Cumulative Variance Explained [ 52.18603365  68.9158021   79.47844095  86.02590063  91.11576952
  94.11218252  96.10354875  97.33504945  98.22626473  98.73597943
  99.10498391  99.36357011  99.5621946   99.71430385  99.82849808
  99.92791726  99.98355026 100.        ]
Out[207]:
[<matplotlib.lines.Line2D at 0x1b4864c8>]
In [208]:
# Ploting 
plt.figure(figsize=(8 , 7))
plt.bar(range(1, eigenvalues.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eigenvalues.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()

7. Repeat steps 3,4 and 5 but this time, use Principal Components instead of the original data. And the accuracy score should be on the same rows of test data that were used earlier

In [209]:
#lets fit pca with 5 attributes
pca3 = PCA(n_components=7)
pca3.fit(XScaled)
print(pca3.components_)
print(pca3.explained_variance_ratio_)
Xpca3 = pca3.transform(XScaled)
[[ 2.75283688e-01  2.93258469e-01  3.04609128e-01  2.67606877e-01
   8.05039890e-02  9.72756855e-02  3.17092750e-01 -3.14133155e-01
   3.13959064e-01  2.82830900e-01  3.09280359e-01  3.13788457e-01
   2.72047492e-01 -2.08137692e-02  4.14555082e-02  5.82250207e-02
   3.02795063e-02  7.41453913e-02]
 [-1.26953763e-01  1.25576727e-01 -7.29516436e-02 -1.89634378e-01
  -1.22174860e-01  1.07482875e-02  4.81181371e-02  1.27498515e-02
   5.99352482e-02  1.16220532e-01  6.22806229e-02  5.37843596e-02
   2.09233172e-01  4.88525148e-01 -5.50899716e-02 -1.24085090e-01
  -5.40914775e-01 -5.40354258e-01]
 [-1.19922479e-01 -2.48205467e-02 -5.60143254e-02  2.75074211e-01
   6.42012966e-01  5.91801304e-01 -9.76283108e-02  5.76484384e-02
  -1.09512416e-01 -1.70641987e-02  5.63239801e-02 -1.08840729e-01
  -3.14636493e-02  2.86277015e-01 -1.15679354e-01 -7.52828901e-02
   8.73592034e-03  3.95242743e-02]
 [ 7.83843562e-02  1.87337408e-01 -7.12008427e-02 -4.26053415e-02
   3.27257119e-02  3.14147277e-02 -9.57485748e-02  8.22901952e-02
  -9.24582989e-02  1.88005612e-01 -1.19844008e-01 -9.17449325e-02
   2.00095228e-01 -6.55051354e-02  6.04794251e-01 -6.66114117e-01
   1.05526253e-01  4.74890311e-02]
 [ 6.95178336e-02 -8.50649539e-02  4.06645651e-02 -4.61473714e-02
  -4.05494487e-02  2.13432566e-01 -1.54853055e-02  7.68518712e-02
   2.17633157e-03 -6.06366845e-02 -4.56472367e-04 -1.95548315e-02
  -6.15991681e-02  1.45530146e-01  7.29189842e-01  5.99196401e-01
  -1.00602332e-01 -2.98614819e-02]
 [ 1.44875476e-01 -3.02731148e-01 -1.38405773e-01  2.48136636e-01
   2.36932611e-01 -4.19330747e-01  1.16100153e-01 -1.41840112e-01
   9.80561329e-02 -4.61674972e-01  2.36225434e-01  1.57820194e-01
  -1.35576278e-01  2.41356821e-01  2.03209257e-01 -1.91960802e-01
   1.56939174e-01 -2.41222817e-01]
 [ 4.51862331e-01 -2.49103387e-01  7.40350569e-02 -1.76912814e-01
  -3.97876601e-01  5.03413610e-01  6.49879382e-02  1.38112945e-02
   9.66573058e-02 -1.04552173e-01  1.14622578e-01  8.37350220e-02
  -3.73944382e-01  1.11952983e-01 -8.06328902e-02 -2.84558723e-01
   1.81451818e-02  1.57237839e-02]]
[0.52186034 0.16729768 0.10562639 0.0654746  0.05089869 0.02996413
 0.01991366]
In [210]:
Xpca3  #pca applied data
Out[210]:
array([[ 3.34162030e-01, -2.19026358e-01,  1.00158417e+00, ...,
         7.93007079e-02, -7.57446693e-01, -9.01124283e-01],
       [-1.59171085e+00, -4.20602982e-01, -3.69033854e-01, ...,
         6.93948582e-01, -5.17161832e-01,  3.78636988e-01],
       [ 3.76932418e+00,  1.95282752e-01,  8.78587404e-02, ...,
         7.31732265e-01,  7.05041037e-01, -3.45837595e-02],
       ...,
       [ 4.80917387e+00, -1.24931049e-03,  5.32333105e-01, ...,
        -1.34423635e+00, -2.17069763e-01,  5.73248962e-01],
       [-3.29409242e+00, -1.00827615e+00, -3.57003198e-01, ...,
         4.27680052e-02, -4.02491279e-01, -2.02405787e-01],
       [-4.76505347e+00,  3.34899728e-01, -5.68136078e-01, ...,
        -5.40510367e-02, -3.35637136e-01,  5.80978683e-02]])
In [211]:
X_train, X_test, Y_train, Y_test = train_test_split(Xpca3, y, test_size=0.30, random_state=10)
In [212]:
# Building a Support Vector Machine on train data
svc_model = SVC()
svc_model.fit(X_train, y_train)

ypca_SVM = svc_model.predict(X_test)
In [213]:
# check the accuracy on the training set
print('Accuracy of  SVM model on train set: {:.2f}'.format(svc_model.score(X_train, y_train)))
print('Accuracy of  SVM model on train set: {:.2f}'.format(svc_model.score(X_test, y_test)))
Accuracy of  SVM model on train set: 0.94
Accuracy of  SVM model on train set: 0.90
In [214]:
print(classification_report(y_test, ypca_SVM))
              precision    recall  f1-score   support

           0       0.98      0.86      0.92        71
           1       0.91      0.91      0.91       125
           2       0.79      0.91      0.85        58

    accuracy                           0.90       254
   macro avg       0.90      0.89      0.89       254
weighted avg       0.90      0.90      0.90       254

In [215]:
num_folds = 10
seed = 10

kfold = KFold(n_splits=num_folds, random_state=seed,shuffle=True);
model_pca = SVC()
results = cross_val_score(model_pca,Xpca3, y, cv=kfold);
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
[0.88235294 0.89411765 0.92941176 0.88235294 0.95294118 0.90588235
 0.92857143 0.92857143 0.91666667 0.88095238]
Accuracy: 91.018% (2.366%)

8. Compare the accuracy scores and cross validation scores of Support vector machines – one trained using raw data and the other using Principal Components, and mention your findings

We can say SVM model & K-fold cross validation with 18 attributes gives better accuracy compare to PCA on 7 attributes.

Applying PCA although loses some information but it gives dimension reduction for attributes doesn't have any impact on model.

Multicolinearity and Curse of dimensionality adversly impact any machine learning model,with Curse of dimensionality because of the feature space becoming increasingly sparse for an increasing number of dimensions of a fixed-size training dataset, model tend to overfit.

Principal Component Analyis helps adressing these problem and improves the model performance to a great extent.